Goto

Collaborating Authors

 different implementation


The Denario project: Deep knowledge AI agents for scientific discovery

Villaescusa-Navarro, Francisco, Bolliet, Boris, Villanueva-Domingo, Pablo, Bayer, Adrian E., Acquah, Aidan, Amancharla, Chetana, Barzilay-Siegal, Almog, Bermejo, Pablo, Bilodeau, Camille, Ramírez, Pablo Cárdenas, Cranmer, Miles, França, Urbano L., Hahn, ChangHoon, Jiang, Yan-Fei, Jimenez, Raul, Lee, Jun-Young, Lerario, Antonio, Mamun, Osman, Meier, Thomas, Ojha, Anupam A., Protopapas, Pavlos, Roy, Shimanto, Spergel, David N., Tarancón-Álvarez, Pedro, Tiwari, Ujjwal, Viel, Matteo, Wadekar, Digvijay, Wang, Chi, Wang, Bonny Y., Xu, Licong, Yovel, Yossi, Yue, Shuwen, Zhou, Wen-Han, Zhu, Qiyao, Zou, Jiajun, Zubeldia, Íñigo

arXiv.org Artificial Intelligence

We present Denario, an AI multi-agent system designed to serve as a scientific research assistant. Denario can perform many different tasks, such as generating ideas, checking the literature, developing research plans, writing and executing code, making plots, and drafting and reviewing a scientific paper. The system has a modular architecture, allowing it to handle specific tasks, such as generating an idea, or carrying out end-to-end scientific analysis using Cmbagent as a deep-research backend. In this work, we describe in detail Denario and its modules, and illustrate its capabilities by presenting multiple AI-generated papers generated by it in many different scientific disciplines such as astrophysics, biology, biophysics, biomedical informatics, chemistry, material science, mathematical physics, medicine, neuroscience and planetary science. Denario also excels at combining ideas from different disciplines, and we illustrate this by showing a paper that applies methods from quantum physics and machine learning to astrophysical data. We report the evaluations performed on these papers by domain experts, who provided both numerical scores and review-like feedback. We then highlight the strengths, weaknesses, and limitations of the current system. Finally, we discuss the ethical implications of AI-driven research and reflect on how such technology relates to the philosophy of science. We publicly release the code at https://github.com/AstroPilot-AI/Denario. A Denario demo can also be run directly on the web at https://huggingface.co/spaces/astropilot-ai/Denario, and the full app will be deployed on the cloud.


Self-Designing Software

Communications of the ACM

In computing, we have a great diversity of tools at our disposal to perform any given task. We know many different sorting algorithms, cache eviction policies, hash functions, compression algorithms, scheduling approaches, and so on. And at a higher level, we have many ways to combine these alternatives into suitable systems. Many trainees in software engineering will have learned these alternatives and have a set of heuristics they can draw on to make reasonable design choices for each new problem. Selecting the right set of tools for a new task is based on a mixture of engineering experience to narrow the initial design options, with deployment-based feedback to fine-tune (or sometimes entirely redesign) the solution.


The Fall of ROME: Understanding the Collapse of LLMs in Model Editing

Yang, Wanli, Sun, Fei, Tan, Jiajun, Ma, Xinyu, Su, Du, Yin, Dawei, Shen, Huawei

arXiv.org Artificial Intelligence

Despite significant progress in model editing methods, their application in real-world scenarios remains challenging as they often cause large language models (LLMs) to collapse. Among them, ROME is particularly concerning, as it could disrupt LLMs with only a single edit. In this paper, we study the root causes of such collapse. Through extensive analysis, we identify two primary factors that contribute to the collapse: i) inconsistent handling of prefixed and unprefixed keys in the parameter update equation may result in very small denominators, causing excessively large parameter updates; ii) the subject of collapse cases is usually the first token, whose unprefixed key distribution significantly differs from the prefixed key distribution in autoregressive transformers, causing the aforementioned issue to materialize. To validate our analysis, we propose a simple yet effective approach: uniformly using prefixed keys during editing phase and adding prefixes during the testing phase. The experimental results show that the proposed solution can prevent model collapse while maintaining the effectiveness of the edits.


Differential testing for machine learning: an analysis for classification algorithms beyond deep learning

Herbold, Steffen, Tunkel, Steffen

arXiv.org Artificial Intelligence

Context: Differential testing is a useful approach that uses different implementations of the same algorithms and compares the results for software testing. In recent years, this approach was successfully used for test campaigns of deep learning frameworks. Objective: There is little knowledge on the application of differential testing beyond deep learning. Within this article, we want to close this gap for classification algorithms. Method: We conduct a case study using Scikit-learn, Weka, Spark MLlib, and Caret in which we identify the potential of differential testing by considering which algorithms are available in multiple frameworks, the feasibility by identifying pairs of algorithms that should exhibit the same behavior, and the effectiveness by executing tests for the identified pairs and analyzing the deviations. Results: While we found a large potential for popular algorithms, the feasibility seems limited because often it is not possible to determine configurations that are the same in other frameworks. The execution of the feasible tests revealed that there is a large amount of deviations for the scores and classes. Only a lenient approach based on statistical significance of classes does not lead to a huge amount of test failures. Conclusions: The potential of differential testing beyond deep learning seems limited for research into the quality of machine learning libraries. Practitioners may still use the approach if they have deep knowledge about implementations, especially if a coarse oracle that only considers significant differences of classes is sufficient.